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Abstract 

Methods that can evaluate aggregate effects of rare and common variants are limited. Therefore, we applied a 
two-stage approach to evaluate aggregate gene effects in the 1000 Genomes Project data, which contain 24,487 
single-nucleotide polymorphisms (SNPs) in 697 unrelated individuals from 7 populations. In stage 1, we identified 
potentially interesting genes (PIGs) as those having at least one SNP meeting Bonferroni correction using 
univariate, multiple regression models. In stage 2, we evaluate aggregate PIG effects on trait, Q1, by modeling each 
gene as a latent construct, which is defined by multiple common and rare variants, using the multivariate statistical 
framework of structural equation modeling (SEM). In stage 1, we found that PIGs varied markedly between a 
randomly selected replicate (replicate 137) and 100 other replicates, with the exception of FLT1. In stage 1, 
collapsing rare variants decreased false positives but increased false negatives. In stage 2, we developed a good- 
fitting SEM model that included all nine genes simulated to affect Q1 (FLT1, KDR, ARNT, ELAV4, FLT4, HIF1A, HIF3A, 
VEGFA, VEGFC) and found that FLT1 had the largest effect on Q1 (J3 std = 0.33 ± 0.05). Using replicate 137 estimates 
as population values, we found that the mean relative bias in the parameters (loadings, paths, residuals) and their 
standard errors across 100 replicates was on average, less than 5%. Our latent variable SEM approach provides a 
viable framework for modeling aggregate effects of rare and common variants in multiple genes, but more elegant 
methods are needed in stage 1 to minimize type I and type II error. 



Background 

The 1000 Genomes Project is an international public- 
private consortium aiming to build the most detailed 
map of human genetic variation with the overarching 
goal to improve our understanding of the genetic contri- 
bution to common human diseases. Initially launched in 
2008, three pilot studies have been completed to test 
multiple sequencing methods. Pilot Project 3 involved 
sequencing the coding regions (exons) of 3,205 genes in 
697 individuals from 7 populations, which revealed 
24,487 rare and common genetic variants. The 
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sequencing data from Pilot Project 3 were used for 
Genetic Analysis Workshop 17 (GAW17), and details of 
this data set, including how the phenotypes were simu- 
lated, can be found in Almasy et al. [1]. 

Although strategies have been developed to evaluate 
the contribution of rare variants to disease susceptibility 
in nonfamilial data, including collapsing methods, which 
are reviewed by Dering et al. [2], approaches that evalu- 
ate the combined or aggregate effects of rare and com- 
mon variants together are limited. Thus in this paper 
we aim to evaluate the aggregate effects of rare and 
common single-nucleotide polymorphisms (SNPs) in 
genes on the simulated quantitative trait Ql using the 
Pilot Project 3 data (unrelated subjects). In stage 1 we 
use multiple regression methods (with and without 
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collapsing rare variants) to identify potentially interest- 
ing genes (PIGs); in stage 2, we use a latent variable 
structural equation modeling (SEM) approach to evalu- 
ate aggregate effects of rare and common variants in 
PIGs on Ql. During our initial analyses, we were 
blinded to the "answers" of the simulated model. In post 
hoc analyses, we used knowledge that 39 SNPs in 9 
genes, primarily in the vascular endothelial growth fac- 
tor (VEGF) pathway, were simulated to be associated 
with Ql. 

Methods 

Data cleaning and preparation: phenotype and genotype 
variables 

We first examined the distribution of Ql, which we 
arbitrarily chose from the three simulated quantitative 
phenotypes available (see Almasy et al. [1]), in a ran- 
domly chosen replicate (replicate 137) of the unrelated 
individuals from the GAW17 data using SAS, v. 9.1 
(SAS Institute Inc., Cary, North Carolina). Visual inspec- 
tion of histograms and quantile-quantile (Q-Q) plots 
and Shapiro-Wilk and Kolmogorov-Smirnov tests indi- 
cated that Ql was essentially normally distributed. Sum- 
mary statistics and Mendelian inheritance errors were 
evaluated using PLINK, v. 1.07 [3]. 

Stage 1: statistical methods for regression-based analyses 

We evaluated the association between each SNP as an 
additive model (0, 1, or 2 copies of the minor allele) 
and Ql using linear regression models adjusted for all 
the covariates provided in the GAW17 data set (Age, 
Sex, Smoking, population [Popl]) using PLINK, v. 
1.07. In addition, we collapsed rare variants (minor 
allele frequency [MAF] < 0.05) in each gene using the 
indicator coding method [2], which assumes equal 
weighting of each rare variant. We also adjusted the 
models for population substructure using principal 
components (PCs). PCs were generated using the cen- 
tralized scoring matrix method of Qin et al. [4,5] in 
MATLAB (Math Works, Boston, Massachusetts). We 
adjusted models for multiple PCs and found that 
adjusting for 10 or 12 PCs minimized the number of 
false positives (see Results section, Table three, and 
additional details in Qin et al. [5]). 

Stage 2: statistical methods for latent variable structural 
equation modeling 

Our approach for modeling multiple common variants 
in genes using latent constructs has been described pre- 
viously [6]. Essentially, we let a latent variable (ovals in 
Figure 1, e.g., FLT1) represent the overall variation in a 
gene, which we formally describe by multiple SNPs (rec- 
tangles in Figure 1, e.g., C13S522) in that gene. In terms 
of notation, briefly, in latent variable structural equation 



modeling (SEM), two general submodels are used: (1) a 
measurement model that develops the relations (load- 
ings; e.g., the arrow from FLT1 to C13S522 in Figure 1) 
between the observed variables and the latent con- 
structs; and (2) a structural model that develops the 
relations (path coefficients; e.g., the arrow from PopStr 
to FLT1 in Figure 1) between the latent variables. The 
general form of the measurement model is: 
y = Ayn + e, (1) 

where y is the p x 1 vector of observed variables, T| is 
the m x 1 vector of latent random variables, £ is the p x 
1 vector of measurement errors for y, and A y is the p x 
n matrix of coefficients relating y to T|. 

The general form of the structural model imposes 
constraints such that: 

Tl = Bn + 5, (2) 

where B is the m x m matrix of path coefficients and 
£ is the m x 1 vector of errors or disturbances in the 
endogenous (dependent) latent variables. 

The structural model can be modified by adding a q- 
dimensional vector of covariates (x), an m x q matrix of 
regression coefficients (r), and an m-dimensional vector 
of intercepts (a): 

ti = a + Bti + Tx + £ (3) 

Similar to our prior work using common variants to 
describe overall variation in a gene [6,7], we used eigen- 
values, scree plots, reliability, linkage disequilibrium 
plots (Haploview, v. 4.2), and association results from 
stage 1 to help select the most informative SNPs and 
define parsimonious latent gene constructs. We per- 
formed confirmatory factor analysis using a robust max- 
imum-likelihood estimator, which provides test statistics 
and standard errors robust to nonnormality, using 
Mplus, v. 5.1 (Muthen and Muthen, Los Angeles, Cali- 
fornia), to generate and test single latent gene construct 
models, which included adjustment for covariates (Age, 
Sex, Smoking) and population structure (modeled using 
a latent variable defined by Popl and the top 12 PCs). 
To assess the overall model goodness-of-fit, we used the 
chi-square test, the comparative fit index (CFI), the root 
mean-square error of approximation (RMSEA), and the 
standardized root mean-square residual (SRMR) [8]. 

The chi-square test evaluates whether the covariance 
matrix is equal to the model-implied covariance matrix 
predicted by the parameters, but it is sensitive to sample 
size and complexity. Thus, other fit indexes, including 
the CFI, RMSEA, and SRMR, have been used to evaluate 
model fit [8]. The CFI is relatively insensitive to sample 
size and model complexity, and CFI > 0.95 and CFI > 
0.90 suggest good and acceptable fit, respectively [9]. 
The RMSEA is less sensitive to sample size and favors 
more parsimonious models. An RMSEA < 0.06 repre- 
sents good fit, and an RMSEA < 0.10 yields acceptable 
fit [9]. An SRMR < 0.08 represents a good fit, and an 
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Figure 1 Modeling the aggregate effects of common and rare variants in FLT1 using latent variable structural equation modeling. 

Adding rare variants (B) to the FLT1 latent construct composed of common variants (A) improved the model fit (A: CFI = 0.90, RMSEA = 0.03, 
SRMR = 0.08; vs. B: CFI = 0.96, RMSEA = 0.02, SRMR = 0.05) and the variance explained in Q1 (ft 2 : 0.36 ± 0.04 (B) vs. 0.30 ± 0.04 (A)). Standardized 
parameters and standard errors are shown above the arrows. Yellow, rare variant; blue, population substructure (PopStr; principal component, 
PC); red, gene; green, trait. * p < 0.05; ** p < 0.001. Residuals not shown for clarity. 



SRMR < 0.10 represents an acceptable fit [8,9]. We eval- 
uated the performance of the SEM model by calculating 
the mean relative bias in the parameters and their stan- 
dard errors across 100 replicates (replicates 99-136 and 
138-200) available in the GAW17 data [1], All ^-values 
are from two-sided tests. 

Results 

Without knowledge of the underlying simulated model 
and using a randomly selected replicate (replicate 137), 
we evaluated potential associations between each SNP 
and trait Ql in stage 1. We found that several genes 
had a least one SNP meeting or exceeding the Bonfer- 
roni-corrected level with (p < 8.33 x 10~ 6 ) and without 
(p < 2.04 x 10~ 6 ) collapsing rare variants (MAF < 0.05) 
(Table 1), but the most significant associations were 



observed with common (C13S522, C13S523) and rare 
(C1S3524) variants in FLT1 (Table 2). 

In stage 2, when building the FLT1 construct using 
replicate 137, we found that adding rare variants to the 
common variants improved the model fit (CFI = 0.90, 
RMSEA = 0.03, and SRMR = 0.08 in Figure 1A vs. CFI 
= 0.96, RMSEA = 0.02, and SRMR = 0.05 in Figure IB), 
improved construct reliability (Cronbach's a: 0.40 (A) 
vs. 0.53 (B)), and increased the variance explained in Ql 
{R 2 : 0.30 ± 0.04 (A) vs. 0.36 ± 0.04 (B)). In a larger SEM 
(Figure 2) with 6 genes (26 SNPs) and with population 
structure represented by a latent variable (PopStr), we 
found that the path coefficient of FLT1 on Ql (/J std = 
0.49 ± 0.04) was slightly lower than that in the reduced 
model (Figure IB: ^ std = 0.43 ± 0.05), but FLT1 
remained the gene most strongly associated with Ql, 



Table 1 Top potentially interests genes (PIGs) with SNPs associated with Q1 in replicate 137 of GAW17 exon sequencing data (unrelated individuals) 
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Adjusted model 1 a 
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Number of SNPs 
with p < 
2.04 x 1(T 6 


Number of 
SNPs with 
p < 0.10 


Highest p 
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Number of SNPs 
with p < 
2.04 x 10 -6 


Number of 
SNPs with 
p < 0.10 


Highest 
P (SNP) 


Number of 
SNPs with p < 
2.04 x 10 -6 


Number of 
SNPs with 
p < 0.10 


Highest 
p (SNP) 


FLT1 


13 


35 


16,389 


2 (C13S522 n, 
C13S523 n) 


10 (8 n, 2 s) 


341 X 10 

-18 

(C13S523) 


3 (C13S522 n, 
C13S523 n, C13S524 

n) 


11 (7 n, 4 s) 


5.64 x 10~ 2 ' 
(C13S423) 


2 (C13S522 n, 
C13S523 n) 


11 (7 n, 4 s) 


2.10 X 10 
-11 

(C13S423) 


FADS3 


11 


9 


15,457 


1 (C11S3071 n) 


1 (1 n) 


2.37 X 10~ 7 
(C11S3071) 


1 (C11S3071 n) 


2 (1 n, 1 s) 


1.33 x 10~ 7 
(C11S3071) 


1 (C11S3071 n) 


1 (1 n) 


8.68 X 1 0~ 7 
(C11S3071) 


C50RF25 


5 


22 


55,401 


1 (C5S4371 n) 


2 (1 n, 1 s) 


3.99 X 10~ 5 
(C5S4371) 
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2 (1 n, 1 s) 


345 X 1 0~ 7 
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2 (1 n, 1 s) 
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(CI S1 1 541 ) 


3 (CI S1 1528 n, 
CI S1 1 529 s, 
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a Adjusted for Age, Sex, Smoking, Popl. 

b Adjusted for Age, Sex, Smoking, Popl and top 12 principal components (PCs). 
c n=nonsynonymous SNP; s=synonymous SNP 



Table 2 Select >FLT1 SNPs in GAW17 exon sequencing data (replicate 137; unrelated individuals) and associations with Q1 
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Minor allele frequency 


Crude model 


Adjusted model 1 a 


Adjusted model 2 b 
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0.0007 


0.22 (1.00) 


0.8263 


0.39 (0.94) 


0.6780 


0.42 (0.94) 


0.6546 


0.29 (0.91) 


0.7500 


0.37 (0.91) 


0.6821 


C13S505 


0.0007 


-0.14 (1.00) 


0.8926 


0.25 (0.94) 


0.7933 


0.23 (0.95) 


0.8094 


0.34 (0.91) 


0.7101 


0.41 (0.92) 


0.6583 
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1.03 (1.00) 


0.3052 
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0.2660 


1 .08 (0.94) 
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1.43 (0.91) 
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1.43 (0.91) 
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0.8837 
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0.8673 



a Adjusted for Age and Smoking. 

b Adjusted for Age, Smoking, Sex, and Pop1. 

c Adjusted for Age, Smoking, Sex, Popl and top 10 principal components (PCs). 
d Adjusted for Age, Smoking, Sex, Popl and top 12 PCs. 
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Figure 2 Modeling the aggregate effects of common and rare variants in multiple potentially interesting genes (without knowledge 
of the GAW17 answers) using latent variable structural equation modeling. Model of the associations between 7 putative genes (26 SNPs) 
and Q1 (Q1 ft 2 = 0.36, CFI = 0.90, RMSEA = 0.05, SRMR = 0.07). * p < 0.05; ** p < 0.001. Residuals not shown for clarity. 



followed by SPHKAP, LRRN2, C50RF25, and FADS3. 
Genes AKAP13 and OR2T34 were not associated with 
Ql. Population structure was not significantly associated 
with Ql or with genes where paths are not shown. 

In post hoc analyses, we found that the list of PIGs 
varied markedly across replicates (99-136 and 138- 
200), with the exception of FLT1, which had at least one 
SNP in all 100 replicates exceeding the Bonferroni-cor- 
rected j5-value in models adjusted for 10 or 12 PCs, and 
with and without rare variants collapsed. KDR was the 
next most consistent PIG, which was identified in 12 
and 20 of the 100 replicates when rare variants were 
and were not collapsed, respectively; the results were 
the same when adjusting for 10 and 12 PCs. 

We obtained the answers to the GAW17 simulation 
model to better understand the performance of our 
stage 1 approach and to develop a stage 2 model that 
would more closely reflect the simulated model. The 



answers revealed that 39 SNPs in 9 genes (FLT1, FLT4, 
KDR, ARNT, ELAVL4, HIF1A, HIF3A, VEGFA, VEGFQ, 
primarily in the VEGF pathway, were simulated to be 
associated with Ql. 

With regard to stage 1 performance, we found that 
the number of false-positive genes decreased with 
adjustment for increasing numbers of PCs. As shown in 
Table 3, the number of false-positive genes was lower 
when rare variants were collapsed (8 PCs: ft [mean] = 
1.40, SD = 2.38; 10 PCs: p = 1.25, SD = 2.35; 12 PCs: ft 
= 1.20, SD = 2.26) versus not collapsed (8 PCs: fi = 6.46, 
SD = 11.99; 10 PCs: ft = 5.69, SD = 11.38; 12 PCs: ft = 
5.43, SD = 11.20). The number of true-positive and 
false-negative genes was similar for models adjusted for 
10 and 12 PCs and when rare variants were and were 
not collapsed (Table 3). Relaxing multiple test criteria to 
p < 1.56 x 10 6 (which reflects Bonferroni correction for 
the total number of genes) did not materially improve 
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Table 3 True-positive (TP), false-positive (FP), and false-negative (FN) genes for Q1 over 100 replicates (99-136 and 
138-200) in GAW17 exon sequencing data (unrelated individuals) 

Adjusted model 1 a Adjusted model 2 b Adjusted model 3 C Adjusted model 4 d Adjusted model 5 e 





TP 


FP 


FN 


TP 


FP 


FN 


TP 


FP 


FN 


TP 


FP 


FN 


TP 


FP 


FN 


Mean 


1.80 


43.48 


7.20 


1.23 


5.69 


7.77 


1.23 


5.43 


7.77 


1.13 


1.25 


7.87 


1.13 


1.20 


7.87 


Standard deviation 


0.72 


26.78 


0.72 


0.42 


11.38 


0.42 


0.42 


11.20 


0.42 


0.34 


2.35 


0.34 


0.34 


2.26 


0.34 


Range 


1-4 


2-122 


5-8 


1-2 


0-43 


7-8 


1-2 


0-42 


7-8 


1-2 


0-14 


7-8 


1-2 


0-14 


7-8 



a Adjusted for Age, Smoking, Sex, and population (Popl). 
b Adjusted for Age, Smoking, Sex, Popl and top 10 PCs. 
c Adjusted for Age, Smoking, Sex, Popl and top 12 PCs. 

d Rare variants collapsed, adjusted for Age, Smoking, Sex, Popl and top 10 PCs. 
e Rare variants collapsed, adjusted for Age, Smoking, Sex, Popl and top 12 PCs. 



the number of true-positive genes when rare variants 
were collapsed (not shown). Although we were most 
interested in identifying causal genes in stage 1, we note 
that the number of false-positive SNPs over the 100 
replicates decreased with adjustment for increasing 
numbers of PCs (Popl: fi = 58.74, SD = 36.34; 8 PCs: fi 
= 6.68, SD = 12.38; 10 PCs: fi = 5.89, SD = 11.76; 12 
PCs: fi = 5.61, SD = 11.57). The numbers of false-nega- 
tive SNPs were similar when adjusting for 10 PCs {fi = 
36.31, SD = 1.06) and 12 PCs {fi = 36.32, SD = 1.05) 
with most replicates correctly identifying FLT1 SNPs 
C13S522 and C13S523 (not shown). 

In regards to building the Stage 2 model, because the 
GAW17 answers provided only a list of the nine genes 
simulated to be associated with Ql, we used the path- 
way database of the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) (http://www.genome.jp/kegg/pathway. 
html; VEGF Signaling, Cytokine-Cytokine Receptor 
Interaction, Pathways in Cancer) to better understand 
the biological relationships between the nine genes. We 
developed a good-fitting model (CFI = 0.90, RMSEA = 
0.04, SRMR = 0.03) that included all nine genes simu- 
lated to affect Ql (Figure 3). The variance explained in 
Ql (R 2 = 0.42) was greater than in prior models. FLT1 
remained the gene most strongly associated with Ql, 
followed by ARNT, VEGF A, KDR, VEGFC, FLT4, and 
HIF3A. Smoking was simulated to be associated with 
KDR, but we observed only a marginal association (j3 std 
= 0.05 ± 0.02, p = 0.08; not shown) and found that 
Smoking was more highly associated with HIF3A and 
ELAVL4. Modeling all nine genes simultaneously 
revealed that HIF1A was associated with VEGFC but 
that ELAVL4 and HIF1A were not associated with Ql. 
Removing paths designated by a dashed line (Figure 3) 
resulted in a slightly improved model fit (CFI = 0.91, 
RMSEA = 0.04, SRMR = 0.02), but the magnitude of the 
paths from genes to Ql remained similar. 

To evaluate the performance of the stage 2 SEM 
model, we used estimates from a randomly selected 
replicate (replicate 137) to represent population values 



(because the GAW17 answers did not contain standar- 
dized estimates for aggregate gene effects that we could 
directly compare to) and compared these population 
values to estimates from 100 replicates (replicates 99- 
136 and 138-200) available in the GAW17 data [1]. We 
found that the mean relative bias (MRB) in the para- 
meters and in the standard errors across 100 replicates 
was 4.27% and 4.69%, respectively. The MRBs in stan- 
dardized loadings and residuals were 1.47% for A, 0.09% 
for £, and 0.57% for i^; the MRBs in the standard errors 
(SEs) were 0.32% for A, 0.98% for £, and 2.19% for (J. 
The MRB was generally similar between common and 
rare variants. For example, in the FLT1 common SNP 
C13S523 (MAF = 0.07) and the FLT1 rare SNP 
C13S524 (MAF = 0.004), the MRB in A and the SE of A 
were 0.35% in C13S523, 0.62% in C13S524, and 0.33% 
in C13S523 vs. 0.98% in C13S524. The MRB of £ was 
0.31% in C13S523 and 0.22% in C13S524, and the MRB 
of the SE of £ was 0.63% in C13S523 and 0.20% in 
C13S534. The largest bias was observed in the path 
coefficients Q3 = 19.9%; SE of ft = 9.79%), which was 
quite severe in some cases, such as the HIF1A path 
coefficient, where the bias reached 67.94%. Interestingly, 
the post hoc analysis revealed that genes represented by 
private variants, such as HIF1A, that were associated 
with Ql in replicate 137 were not significantly asso- 
ciated with Ql in most replicates. Also, the effects of 
covariates (Age, Smoking, Sex, Popl, PCs) varied mark- 
edly across replicates. Model fit, however, was generally 
consistent across replicates, with the average CFI, 
RMSEA, and RMSR being 0.91, 0.04, and 0.02, 
respectively. 

Discussion 

In stage 1, adjusting for 10 or 12 PCs (see also Qin et 
al. [5]) and collapsing rare variants using the indicator 
coding method decreased the number of false-positive 
genes by about 78% (1.2 vs. 5.4), on average, but the 
number of false-negative genes remained high regard- 
less of whether rare variants were collapsed or not 
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Figure 3 Modeling the aggregate effects of common and rare variants in multiple genes (with knowledge of the answers) using 
latent variable structural equation modeling. Model of the associations between 9 genes (19 SNPs) simulated to affect Q1 (Q1 ft 2 = 0.42, CFI 
= 0.90, RMSEA = 0.04, SRMR = 0.03). * p < 0.10; ** p < 0.05; *** p < 0.01. Residuals and paths from population structure not shown for clarity. 



(7.9 vs. 7.8). This is striking because we missed identi- 
fying about 87% of the simulated causal genes and cor- 
rectly identified only one gene (11.1%; FLT1) over all 
100 replicates (replicates 99-136 and 138-200). Our 
stage 2 SEM results were able to confirm the impor- 
tance of FLT1, because irrespective of the other genes 
included in the model (i.e., the false-positive model in 
Figure 2 and the answer-driven model in Figure 3), the 
FLT1 construct consistently had the strongest associa- 
tion with Ql in replicate 137 and across all other 
replicates (replicates 99-136 and 138-200). We found 
that the MRB in the answer-driven stage 2 SEM mod- 
el's parameters and standard errors across 100 repli- 
cates was less than 5%. In addition, our stage 2 SEM 
model (Figure 3) revealed relationships between genes 
(e.g., ARNT and FLT1) and between covariates and 
genes (e.g., Smoking and ELAVL4) that were not 



discussed in the GAW17 answers. Thus, we believe 
that modeling all nine genes simultaneously together 
with the relevant environmental factors and population 
structure in a hierarchical manner that better reflects 
the underlying biology using latent variable SEM pro- 
vides an improved understanding of each gene's rele- 
vance in the disease pathophysiology compared to 
standard multiple regression methods. 

Conclusions 

Our latent gene construct approach provides a viable 
framework for evaluating the aggregate effects of rare 
and common variants in multiple genes on a trait while 
adjusting for population substructure; however, more 
elegant methods are needed in stage 1 to minimize false 
positives and concomitantly improve identification of 
true-positive genes. 
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